fix(backend): bound first-file cache wait in /v1/sync/audio urls endpoint (#7325) by kodjima33 · Pull Request #7815 · BasedHardware/omi

kodjima33 · 2026-06-11T06:14:36Z

Bug

#7325 — audio download endpoints hang with 0 bytes (90–300s, then timeout). Confirmed live in prod LB logs: GET /v1/sync/audio/{id}/urls repeatedly took 60–115s on 2026-06-10 alone; anything slower hits the 120s TimeoutMiddleware / client timeouts and surfaces as the reported hang.

Root cause

get_audio_signed_urls_endpoint merges the first uncached audio file synchronously in the request thread ("cache synchronously for immediate playback", routers/sync.py). For large recordings (hundreds of chunks) or under GCS pushback, download_audio_chunks_and_merge takes minutes, so the request blocks with zero bytes sent.

(The 429s the reporter saw are produced at the edge — those requests never reach backend pods, no app-level rate limit exists on this path — so the limiter half of the issue is infra config, out of scope here.)

Fix

Bound the wait: submit _precache_audio_file to sync_executor (not storage_executor — its chunk-download children run there; deadlock rule 3) and wait at most FIRST_FILE_CACHE_WAIT_SECONDS = 15s. Healthy merges (0.1–5s per prod logs) behave exactly as before. On timeout the merge keeps running and lands in the GCS cache; the file is returned as pending with a warning log.

Clients already handle pending: the Flutter player falls back to the direct stream URL, and the web detail panel precaches + refetches. A fast pending beats a 60–115s block or an opaque 504.

Tests

tests/unit/test_sync_urls_bounded_wait.py (registered in test.sh): structural guards (bounded result(timeout=), no inline merge, coordinator not on storage_executor, budget < middleware timeout) + logic tests of the wait/timeout primitive (timed-out merge still completes in background). All 6 verified locally via stdlib harness (pytest unavailable locally; CI runs the suite). py_compile, lint_async_blockers, black clean. UNVERIFIED at runtime against a live large-conversation merge.

🤖 automated by hourly watchdog; opened for review, not merged.

Fixes #7325

…oint (#7325)

…int (#7325)

greptile-apps · 2026-06-11T06:21:03Z

Greptile Summary

This PR fixes a production hang on GET /v1/sync/audio/{id}/urls where large first-file audio merges (60–115s) exceeded both client timeouts and the 120s middleware limit. The fix submits _precache_audio_file to sync_executor (not storage_executor, to prevent deadlock since its chunk-download children run there), waits at most 15 s via Future.result(timeout=), and returns the file as pending on timeout while the merge finishes in the background.

backend/routers/sync.py: Replaces the unbounded inline _precache_audio_file call with a bounded sync_executor-backed future and a FuturesTimeoutError catch; introduces FIRST_FILE_CACHE_WAIT_SECONDS = 15.0.
backend/tests/unit/test_sync_urls_bounded_wait.py: New test file with structural guards (bounded wait, no inline call, correct pool, budget < 120s) and a ThreadPoolExecutor-based proof that Future.result(timeout=) leaves the background task running to completion after timing out.

Confidence Score: 4/5

The fix correctly bounds the first-file wait to 15 s and the executor choice avoids deadlock. Safe to merge, with two minor gaps worth addressing before heavy production use.

The core logic is sound: sync_executor is the right pool (its chunk-download children run on storage_executor), the 15s budget is well under the 120s middleware timeout, and the pending fallback is already handled by clients. Two issues reduce confidence slightly: the except FuturesTimeoutError block does not catch concurrent.futures.CancelledError, so a queued-but-cancelled future during shutdown produces a 500 instead of a graceful pending response; and the structural test only checks != storage_executor (negative), so any other wrong executor choice would go undetected.

backend/routers/sync.py (exception handling around lines 297-303) and backend/tests/unit/test_sync_urls_bounded_wait.py (pool assertion at lines 64-66)

Important Files Changed

Filename	Overview
backend/routers/sync.py	Core fix: replaces unbounded inline _precache_audio_file with a sync_executor-backed future and 15s timeout; logic is correct and executor choice avoids deadlock, but CancelledError from future cancellation is not handled alongside FuturesTimeoutError.
backend/tests/unit/test_sync_urls_bounded_wait.py	Good structural and primitive tests; the pool assertion in test_coordinator_not_on_storage_executor only checks the negative case (!= storage_executor) and would pass for any other (incorrect) executor choice.
backend/test.sh	Registers the new test file in the test suite; no issues.

_{Reviews (1): Last reviewed commit: "test(backend): register test_sync_urls_b..." | Re-trigger Greptile}

greptile-apps · 2026-06-11T06:21:07Z

+        assert (
+            m.group(1) != 'storage_executor'
+        ), "coordinator must not share storage_executor with its chunk-download children"


The assertion only guards the negative case. If a future refactor changes the pool to postprocess_executor, db_executor, or any other executor, this test still passes — including pools whose workers could deadlock or whose purpose clashes with audio-merge work. Asserting positively on sync_executor catches any such drift.

Suggested change

assert (

m.group(1) != 'storage_executor'

), "coordinator must not share storage_executor with its chunk-download children"

assert (

m.group(1) == 'sync_executor'

), "coordinator must run on sync_executor (not storage_executor, to avoid deadlock with its children)"

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-11T06:21:08Z

+                try:
+                    first_file_future.result(timeout=FIRST_FILE_CACHE_WAIT_SECONDS)
+                except FuturesTimeoutError:
+                    logger.warning(
+                        f"sync_urls first-file cache exceeded {FIRST_FILE_CACHE_WAIT_SECONDS}s, returning pending "
+                        f"uid={uid} convo={conversation_id} file={audio_file_id}"
+                    )


CancelledError not caught alongside FuturesTimeoutError

future.result() can also raise concurrent.futures.CancelledError — a separate exception class from FuturesTimeoutError — when the future is cancelled before it starts executing. This happens when sync_executor is at capacity and shutdown_executors(cancel_futures=True) fires during a graceful shutdown: the queued task is cancelled and result() raises rather than blocking. The exception propagates unhandled through the endpoint and becomes a 500 instead of the graceful pending response that would be correct here. Catching (FuturesTimeoutError, CancelledError) in the same handler would close the gap.

kodjima33 added 3 commits June 11, 2026 02:14

fix(backend): bound first-file cache wait in /v1/sync/audio urls endp…

f4aefdf

…oint (#7325)

test(backend): cover bounded first-file cache wait in sync urls endpo…

73888b9

…int (#7325)

test(backend): register test_sync_urls_bounded_wait in test.sh

d507b00

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

mdmohsin7 mentioned this pull request Jun 12, 2026

Sync infra changes — May 28: storage pool, autoscaling, and app-side rate-limit fixes #7531

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backend): bound first-file cache wait in /v1/sync/audio urls endpoint (#7325)#7815

fix(backend): bound first-file cache wait in /v1/sync/audio urls endpoint (#7325)#7815
kodjima33 wants to merge 3 commits into
mainfrom
watchdog/issue-7325-urls-bounded-wait

kodjima33 commented Jun 11, 2026

Uh oh!

greptile-apps Bot commented Jun 11, 2026

Uh oh!

greptile-apps Bot Jun 11, 2026

Uh oh!

greptile-apps Bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kodjima33 commented Jun 11, 2026

Bug

Root cause

Fix

Tests

Uh oh!

greptile-apps Bot commented Jun 11, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant